Search CORE

70 research outputs found

A classification approach for detecting cross-lingual biomedical term translations

Author: Bollegala D
Hakami H
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2017
Field of study

University of Liverpool Repository

Identifying Argument based Relation Properties in Opinions

Author: Bollegala D
Parsons
Rajendran
Publication venue
Publication date: 16/08/2017
Field of study

University of Liverpool Repository

Compositional Approaches for Representing Relations Between Words: A Comparative Study

Author: Bollegala D
Hakami
Publication venue: 'Elsevier BV'
Publication date
Field of study

Identifying the relations that exist between words (or entities) is important for various natural language processing tasks such as, relational search, noun-modifier classification and analogy detection. A popular approach to represent the relations between a pair of words is to extract the patterns in which the words co-occur with from a corpus, and assign each word-pair a vector of pattern frequencies. Despite the simplicity of this approach, it suffers from data sparseness, information scalability and linguistic creativity as the model is unable to handle previously unseen word pairs in a corpus. In contrast, a compositional approach for representing relations between words overcomes these issues by using the attributes of each individual word to indirectly compose a representation for the common relations that hold between the two words. This study aims to compare different operations for creating relation representations from word-level representations. We investigate the performance of the compositional methods by measuring the relational similarities using several benchmark datasets for word analogy. Moreover, we evaluate the different relation representations in a knowledge base completion task

University of Liverpool Repository

Learning Linear Transformations between Counting-based and Prediction-based Word Embeddings

Author: Bollegala D
Hayashi Kohei
Kawarabayashi Ken-ichi
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 19/09/2017
Field of study

Despite the growing interest in prediction-based word embedding learning methods, it remains unclear as to how the vector spaces learnt by the prediction-based methods differ from that of the counting-based methods, or whether one can be transformed into the other. To study the relationship between counting-based and prediction-based embeddings, we propose a method for learning a linear transformation between two given sets of word embeddings. Our proposal contributes to the word embedding learning research in three ways: (a) we propose an efficient method to learn a linear transformation between two sets of word embeddings, (b) using the transformation learnt in (a), we empirically show that it is possible to predict distributed word embeddings for novel unseen words, and (c) empirically it is possible to linearly transform counting-based embeddings to prediction-based embeddings, for frequent words, different POS categories, and varying degrees of ambiguities

University of Liverpool Repository

Directory of Open Access Journals

A Comparative Study of Pivot Selection Strategies for Unsupervised Domain Adaptation

Author: Al-Bazzas Noor
Bollegala D
Coenen FP
Cui Xia
Publication venue
Publication date: 19/03/2018
Field of study

University of Liverpool Repository

Evaluating Co-reference Chains based Conversation History in Conversational Question Answering

Author: Bollegala D
Coenen FP
Mandya AA
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/07/2020
Field of study

This paper examines the effect of using co-reference chains based conversational history against the use of entire conversation history for conversational question answering (CoQA) task. The QANet model is modified to include conversational history and NeuralCoref is used to obtain co-reference chains based conversation history. The results of the study indicates that in spite of the availability of a large proportion of co-reference links in CoQA, the abstract nature of questions in CoQA renders it difficult to obtain correct mapping of co-reference related conversation history, and thus results in lower performance compared to systems that use entire conversation history. The effect of co-reference resolution examined on various domains and different conversation length, shows that co-reference resolution across questions is helpful for certain domains and medium-length conversations

University of Liverpool Repository

Crossref

Transductive Learning with String Kernels for Cross-Domain Text Classification

Author: AM Fernández
D Bollegala
G Ifrim
H Lodhi
J Shawe-Taylor
M Franco-Salvador
M Long
Marius Popescu
RT Ionescu
RT Ionescu
RT Ionescu
TG Dietterich
Publication venue
Publication date: 02/11/2018
Field of study

For many text classification tasks, there is a major problem posed by the lack of labeled data in a target domain. Although classifiers for a target domain can be trained on labeled text data from a related source domain, the accuracy of such classifiers is usually lower in the cross-domain setting. Recently, string kernels have obtained state-of-the-art results in various text classification tasks such as native language identification or automatic essay scoring. Moreover, classifiers based on string kernels have been found to be robust to the distribution gap between different domains. In this paper, we formally describe an algorithm composed of two simple yet effective transductive learning approaches to further improve the results of string kernels in cross-domain settings. By adapting string kernels to the test set without using the ground-truth test labels, we report significantly better accuracy rates in cross-domain English polarity classification.Comment: Accepted at ICONIP 2018. arXiv admin note: substantial text overlap with arXiv:1808.0840

arXiv.org e-Print Archive

Crossref

Is something better than nothing? automatically predicting stance-based arguments using deep learning and small labelled dataset

Author: Bollegala D
Parsons S
Rajendran P
Publication venue
Publication date: 01/01/2018
Field of study

Online reviews have become a popular portal among customers making decisions about purchasing products. A number of corpora of reviews have been widely investigated in NLP in general, and, in particular, in argument mining. This is a subset of NLP that deals with extracting arguments and the relations among them from user-based content. A major problem faced by argument mining research is the lack of human-annotated data. In this paper, we investigate the use of weakly supervised and semi-supervised methods for automatically annotating data, and thus providing large annotated datasets. We do this by building on previous work that explores the classification of opinions present in reviews based on whether the stance is expressed explicitly or implicitly. In the work described here, we automatically annotate stance as implicit or explicit and our results show that the datasets we generate, although noisy, can be used to learn better models for implicit/explicit opinion classification

University of Liverpool Repository

Crossref

Tick parasitism classification from noisy medical records

Author: Bollegala D
Neill JO
Noble PJ
Radford AD
Publication venue
Publication date: 01/01/2019
Field of study

Much of the health information in the medical domain comes in the form of clinical narratives. The rich semantic information contained in these notes can be modeled to make inferences that assist the decision making process for medical practitioners, which is particularly important under time and resource constraints. However, the creation of such assistive tools is made difficult given the ubiquity of misspellings, unsegmented words and morphologically complex or rare medical terms. This reduces the coverage of vocabulary terms present in commonly used pretrained distributed word representations that are passed as input to parametric models that makes such predictions. This paper presents an ensemble architecture that combines indomain and general word embeddings to overcome these challenges, showing best performance on a binary classification task when compared to various other baselines. We demonstrate our approach in the context of the veterinary domain for the task of identifying tick parasitism from small animals. The best model shows 84.29% test accuracy, showing some improvement over models, which only use pretrained embeddings that are not specifically trained for the medical sub-domain of interest

University of Liverpool Repository

Correcting crowdsourced annotations to improve detection of outcome types in evidence based medicine

Author: Abaho M
Bollegala D
Dodd S
Williamson P
Publication venue
Publication date: 01/01/2019
Field of study

The validity and authenticity of annotations in datasets massively influences the performance of Natural Language Processing (NLP) systems. In other words, poorly annotated datasets are likely to produce fatal results in at-least most NLP problems hence misinforming consumers of these models, systems or applications. This is a bottleneck in most domains, especially in healthcare where crowdsourcing is a popular strategy in obtaining annotations. In this paper, we present a framework that automatically corrects incorrectly captured annotations of outcomes, thereby improving the quality of the crowdsourced annotations. We investigate a publicly available dataset called EBM-NLP, built to power NLP tasks in support of Evidence based Medicine (EBM) primarily focusing on health outcomes

University of Liverpool Repository